Setup

See this introduction to R before continuing the lesson below.

Note that this lesson is adapted from the Data Carpentry Ecology Lesson. You can find a lot more information there. Text that has been copied from this lesson is highlighted in blue.

#install.packages("tidyverse")
library(tidyverse)

Get the Data

The data used for this less is all about Australia, including its climate over time and recent fires. To learn more about Australia’s devastating fires check out this NY Times article. The data and information comes from Tidy Tuesday, a weekly data project aimed at the R ecosystem. You can find more datasets at https://github.com/rfordatascience/tidytuesday .

The data are available online so the first thing we’ll do is load them into R. To load data you’ll need to use a function.

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages. A function usually takes one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function (‘running it’) is called calling the function.

Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it. You should already have installed the tidyverse package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as tidyr, dplyr, ggplot2, tibble, etc.

We’ll read in our data using the read_csv() function, from the tidyverse package readr. We assign each dataset to a variable so we can reuse it. The name of the variable is on the left side of the arrow. The action we’re doing with the function is on the right side.

rainfall <- read_csv('https://tinyurl.com/Oz-fire-rain')
## Parsed with column specification:
## cols(
##   station_code = col_character(),
##   city_name = col_character(),
##   year = col_double(),
##   month = col_character(),
##   day = col_character(),
##   rainfall = col_double(),
##   period = col_double(),
##   quality = col_character(),
##   lat = col_double(),
##   long = col_double(),
##   station_name = col_character()
## )
temperature <- read_csv('https://tinyurl.com/Oz-fire-temp')
## Parsed with column specification:
## cols(
##   city_name = col_character(),
##   date = col_date(format = ""),
##   temperature = col_double(),
##   temp_type = col_character(),
##   site_name = col_character()
## )

You can see some information about the data we have just loaded. The name of each column is shown along with the type of data in that column. The data are stored in a format we call a data frame.

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by the functions such as read_csv(); in other words, when importing spreadsheets from your hard drive (or the web). A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors).

You will see the message Parsed with column specification, followed by each column name and its data type. When you execute read_csv on a data file, it looks through the first 1000 rows of each column and guesses the data type for each column as it reads it into R. For example, in this dataset, read_csv reads columns as col_double (a numeric data type), and as col_character. You have the option to specify the data type for a column manually by using the col_types argument in read_csv.

If you want to inspect your data frame there are several functions to do so. head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

For more details on this dataset see the Tidy Tuesday site.

This statement doesn’t produce any output because assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name:rainfall.

rainfall
## # A tibble: 179,273 x 11
##    station_code city_name  year month day   rainfall period quality   lat  long station_name       
##    <chr>        <chr>     <dbl> <chr> <chr>    <dbl>  <dbl> <chr>   <dbl> <dbl> <chr>              
##  1 009151       Perth      1967 01    01          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  2 009151       Perth      1967 01    02          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  3 009151       Perth      1967 01    03          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  4 009151       Perth      1967 01    04          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  5 009151       Perth      1967 01    05          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  6 009151       Perth      1967 01    06          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  7 009151       Perth      1967 01    07          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  8 009151       Perth      1967 01    08          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
##  9 009151       Perth      1967 01    09          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
## 10 009151       Perth      1967 01    10          NA     NA <NA>    -32.0  116. Subiaco Wastewater…
## # … with 179,263 more rows

You can also view the data in a separate window by clicking on its name in the Global Environment window. Here you can also see some information about the data. Note how much data we have! Click the arrow next to rainfall to view the columns and type of data in each column. For more information about this dataset (i.e. the metadata) see https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-07/readme.md

Note that we’ve read our data from a website, but you can read local files as well. These files are in csv format, which means plain text where columns are separated by commas. This is a very simple format that avoids all the complexity of Excel. If you need to read an Excel file you can either export as a csv or use a different function to read the data.

For more information on data frames see the Starting with Data section of the Data Carpentry lesson.

Basic data exploration

One handy way to look at your data is to get a summary table. Let’s do this for the temperature dataset.

summary(temperature)
##   city_name              date             temperature     temp_type          site_name        
##  Length:528278      Min.   :1910-01-01   Min.   :-11.5   Length:528278      Length:528278     
##  Class :character   1st Qu.:1940-09-06   1st Qu.: 11.1   Class :character   Class :character  
##  Mode  :character   Median :1967-10-04   Median : 16.3   Mode  :character   Mode  :character  
##                     Mean   :1966-11-09   Mean   : 16.6                                        
##                     3rd Qu.:1993-08-02   3rd Qu.: 21.6                                        
##                     Max.   :2019-05-31   Max.   : 48.3                                        
##                                          NA's   :3465

For a really basic exploratory analysis let’s look at how temperature is changing in Australian cities over time. I’ve summarized our original dataset so you can make your first plot more easily. First: load the summary data, which can be found at https://tinyurl.com/Oz-mean-temp.

#Challenge: load data
yearly_temp <- read_csv('https://tinyurl.com/Oz-mean-temp')
## Parsed with column specification:
## cols(
##   year = col_double(),
##   city_name = col_character(),
##   temperature = col_double()
## )

Now we’ll plot the temperature as a function of time.

Plotting with ggplot2

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

Use the ggplot() function and bind the plot to a specific data frame using the data argument

ggplot(data = yearly_temp)

Define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.

ggplot(yearly_temp, aes(year,temperature))

Add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:

  • geom_point() for scatter plots, dot plots, etc.
  • geom_boxplot() for, well, boxplots!
  • geom_line() for trend lines, time series, etc.

To add a geom to the plot use the + operator. Because we have two continuous variables, let’s use geom_point() first:

ggplot(yearly_temp, aes(year,temperature)) +
  geom_point()

Notice that you have a lot of data points for each year. This is because you have a data point on the mean temperature for each city. Let’s change our plot to show each city in a different color.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_point()

That’s a bit clearer. Let’s show the pattern over time by adding lines between the yearly points.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_point()+geom_line()

And let’s tidy this into a publication-quality plot.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_line() +
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw()

In a few quick commands we can already plot temperature and observe how it’s been increasing.

Notes

- Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
- You can also specify mappings for a given geom independently of the mappings defined globally in the ggplot() function.
- The + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.

Manipulating data

In the prior section I gave you a summary table of temperature data. Let’s consider how you could generate this summary table and do other data manipulation given our original datasets.

Data Manipulation using dplyr and tidyr

The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:

  1. The results from a base R function sometimes depend on the type of data.
  2. Using R expressions in a non standard way, which can be confusing for new learners.
  3. Hidden arguments, having default operations that new learners are not aware of.

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.

This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is non-trivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about dplyr and tidyr, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

We’re going to learn some of the most common dplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

Selecting columns and filtering rows

To choose rows based on a specific criterion, use filter().

For our data let’s speculate that we may be seeing an increase in fires due to changes in maximum daily temperatures. After all, it could be spiking temperatures that allow fires to occur, and that wouldn’t be reflected well in the mean temperature for a day. In this case we need to filter our data to include only data where the column temp_type is listed as max.

temperatures_maxs <- filter(temperature,temp_type == "max")

Now consider what summary information you want. You probably want the average maximum temperature for each city for each year to look at changes over time. Let’s look at how we would calculate the average maximum temperature for one city for one year. Then we’ll be able to extend this to all cities and all years.

First filter your data to include only data from 2019 at “PERTH AIRPORT”. This is a challenge for you.

temperatures_maxs_PerthAir <- filter(temperatures_maxs,site_name == "PERTH AIRPORT")

You should have gotten stuck on how we know whether data came from 2019. That information is in the date column but you have to extract it. This is a great lesson. You first need to think about what you want your data to look like. Once you know what you want you can figure out how to communicate that to the computer.

In this case we’ll use the lubridate package to process date information.

Challenge: load the lubridate library.

library(lubridate)

You can use the year function to extract the year from the date column. Assign that to a new column in your data frame.

temperatures_maxs_PerthAir$year <- year(temperatures_maxs_PerthAir$date)

Now you can filter for data from 2019. This is a challenge.

temperatures_maxs_PerthAir_2019 <- filter(temperatures_maxs_PerthAir,year == 2019)

Now you can calculate the mean max temperature for this site in this year using the mean function. This is a challenge.

mean(temperatures_maxs_PerthAir_2019$temperature)
## [1] 28.5649

Lubridate is a great package for converting dates in various formats. For example you might have a date written in a column as 2012.03.26. You can let the computer know this is a date, and not just a string of characters, using lubridate’s ymd (which stands for year month day) Note that I have given the date as year-month-day and not day-month-year or month-day-year. If you label the rows in your dataset (or your data files) as year-month-day then when you sort them they’ll be in chronological order (rather than having all the January datapoints at the top of your spreadsheet or file list). However, if your dates are in another order you can use the dmy or mdy functions to tell R what the date is. Note that you can use any (or even no separator) between the year, month, and day.

x <- ymd("2012.03.26")
year(x)
## [1] 2012

Pipes

What if you want to do multiple filter steps at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

You probably took the intermediate step approach in your prior work.

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

temperatures_maxs$year <- year(temperatures_maxs$date)
temperatures_maxs_PerthAir_2019 <- filter(filter(temperatures_maxs,site_name == "PERTH AIRPORT"),year == 2019)

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr.

temperatures_maxs_PerthAir_2019 <- temperatures_maxs %>%
  filter(site_name == "PERTH AIRPORT") %>%
  filter(year == 2019)

Note that the data are sent from one function to the next so in subsequent functions you do not provide the name of the data. Make sure to put the pipe at the end of the line when using multiple lines so the processing doesn’t end prematurely.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame temperatures_max, then we filtered for rows with site_name == "PERTH AIRPORT", then we filtered for rows with year == 2019. Make sure to use the double equals sign when checking for equality. A single equals will assign data to a variable.

We can even go from the raw data to the mean all in one step.

temperature %>%
  filter(temp_type == "max") %>%
  filter(site_name == "PERTH AIRPORT")
## # A tibble: 39,963 x 5
##    city_name date       temperature temp_type site_name    
##    <chr>     <date>           <dbl> <chr>     <chr>        
##  1 PERTH     1910-01-01        26.7 max       PERTH AIRPORT
##  2 PERTH     1910-01-02        27   max       PERTH AIRPORT
##  3 PERTH     1910-01-03        27.5 max       PERTH AIRPORT
##  4 PERTH     1910-01-04        24   max       PERTH AIRPORT
##  5 PERTH     1910-01-05        24.8 max       PERTH AIRPORT
##  6 PERTH     1910-01-06        24.4 max       PERTH AIRPORT
##  7 PERTH     1910-01-07        25.3 max       PERTH AIRPORT
##  8 PERTH     1910-01-08        28   max       PERTH AIRPORT
##  9 PERTH     1910-01-09        32.6 max       PERTH AIRPORT
## 10 PERTH     1910-01-10        35.9 max       PERTH AIRPORT
## # … with 39,953 more rows

These are the first few steps but now we realize that we don’t have a year column in our original dataset. Accessing columns the way we did before doesn’t work great with pipe. Instead we’ll use the mutate function to add a year column. Similarly we realize that to calculate the mean we need to access the temperature column. We’ll use pull to pull out this column. Then we can pipe that data to the mean function.

temperature %>%
  filter(temp_type == "max") %>%
  filter(site_name == "PERTH AIRPORT") %>%
  mutate(year = year(date)) %>%
  filter(year == 2019) %>%
  pull(temperature) %>%
  mean()
## [1] 28.5649

Split-apply-combine data analysis and the summarize() function

Now we know how to calculate one example, but we would like the mean max temperature for all cities for all years. To get this information we’ll generate a summary table.

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.

In this case we are interested in one result for each city for each year. That means we’ll group by these variables. Then we’ll generate a summary table. In this table we want to calculate the mean of the temperature for each group.

mean_max_temp <- temperatures_maxs %>%
  group_by(city_name,year) %>%
  summarize(mean_max_temp = mean(temperature))

If you look at your results you’ll see some NA’s in your results. Because there were NAs in your original data, you are averaging real data with NA and producing NA. To avoid this, filter out cases where temperature is given as NA. We check for NAs using the is.na function and check for cases where it’s not NA using !is.na where the ! means not so !is.na means “is not NA”.

mean_max_temp <- temperatures_maxs %>%
  filter(!is.na(temperature)) %>%
  group_by(city_name,year) %>%
  summarize(mean_max_temp = mean(temperature))

Now you can plot mean_max_temp as a function of year. This is a challenge. Don’t forget to fix your axis labels.

ggplot(mean_max_temp, aes(year, mean_max_temp,color=city_name)) +
         geom_line()+
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw()

You can see a couple of things about this plot that could be improved. First we could use more labels on the x axis so we can more easily see which year we’re looking at. We add the layer scale_x_continuous for continous data on the x axis. We specify where we want our axis to break using the breaks argument. We set this equal to the list of breaks we want using c for our list.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_line() +
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw() +
  scale_x_continuous(breaks = c(1910, 1930, 1950, 1970, 1990, 2010))

Now consider that it’s more visually appealing to get the city labels to be in the same order as the lines. We want to make it easy to match each city to its position on the graph. To do this we need to reorder the cities based on their final data point (i.e. the temperature in 2019).

Take the yearly_temp data frame and filter it for 2019. Then arrange the data frame based on the values in the temperature column. Then pull out just the names of the cities (now in the correct order).

labels <- yearly_temp %>% filter(year == 2019) %>% 
  arrange(desc(temperature)) %>% pull(city_name)

Now we can use this list of cities as the ordering in our legend. We add a layer to our plot with the scale_color_discrete function (as we are looking at the colors and they are discrete rather than continuous data). We specify that both breaks (i.e. our legend colors) and labels (i.e. the city names are given by the labels variable.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_line() +
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw() +
  scale_x_continuous(breaks = c(1910, 1930, 1950, 1970, 1990, 2010)) +
  scale_color_discrete(breaks = labels, labels = labels)

For more information on these visual tricks see Claus Wilke’s Data Visualization book.

Tidy data

Note that we have lots of data but it’s a bit hard to compare in the table. That’s because when we work with data in R we typically use “long form”. That means for everyone combination of variables (year and city), we have one observation (mean temp). Humans typically like to see data in “wide form”. That means you (or a colleague) probably collected data like this example.

temps_across_cities <- read_csv("https://tinyurl.com/Oz-temp-across-cities")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   temp_type = col_character(),
##   `PERTH AIRPORT` = col_double(),
##   `PORT LINCOLN AWS` = col_double(),
##   `KENT TOWN` = col_double(),
##   `BRISBANE AERO` = col_double(),
##   `SYDNEY (OBSERVATORY HILL)` = col_double(),
##   `CANBERRA AIRPORT` = col_double(),
##   `MELBOURNE (OLYMPIC PARK)` = col_double()
## )
head(temps_across_cities)
## # A tibble: 6 x 9
##   date       temp_type `PERTH AIRPORT` `PORT LINCOLN A… `KENT TOWN` `BRISBANE AERO`
##   <date>     <chr>               <dbl>            <dbl>       <dbl>           <dbl>
## 1 2019-05-31 max                  26.3             15.6        15.2            19.7
## 2 2019-05-31 min                  13.8              7.2         8.1             5.2
## 3 2019-05-30 max                  23.5             15.1        15.2            22.5
## 4 2019-05-30 min                  13.1              7.9         5               8.5
## 5 2019-05-29 max                  23.6             14.6        14              22.6
## 6 2019-05-29 min                  10.1             11.1        10.2             7.5
## # … with 3 more variables: `SYDNEY (OBSERVATORY HILL)` <dbl>, `CANBERRA AIRPORT` <dbl>, `MELBOURNE
## #   (OLYMPIC PARK)` <dbl>

Seeing the data like this makes comparisons much easier but it’s harder to plot or summarize in the way we’ve been doing. To work with the data the way we’ve been doing we need to have just one observation per row. We use the pivot_longer functions and specify the columns that include our observations. In this case it is the columns that are labeled with site names and contain temperature measurements. This means we will leave date and temp_type columns and make new columns labeled site and temperature. We then specify two column names: (1) the names that come from the column names will go into a column named site; (2) the second column will contain the temperature values.

temps_long <- temps_across_cities %>%
  pivot_longer("PERTH AIRPORT":"MELBOURNE (OLYMPIC PARK)", 
               names_to = "site", values_to = "temp")

This gives us four columns: date, temp_type, site, and temp, with one observation per row.

Challenge: try this with a rainfall dataset.

rain_across_cities <- read_csv("https://tinyurl.com/Oz-rain-across-cities")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   month = col_character(),
##   day = col_character(),
##   `Subiaco Wastewater Treatment Plant` = col_double(),
##   `North Adelaide` = col_double(),
##   `Greenslopes Private Hospital` = col_logical(),
##   Brisbane = col_double(),
##   `Observatory Hill` = col_double(),
##   `Canberra Airport` = col_double(),
##   `Melbourne Botanical Gardens` = col_double()
## )
## Warning: 2431 parsing failures.
##  row                          col           expected actual                                        file
## 1490 Greenslopes Private Hospital 1/0/T/F/TRUE/FALSE   110  'https://tinyurl.com/Oz-rain-across-cities'
## 1495 Greenslopes Private Hospital 1/0/T/F/TRUE/FALSE   66   'https://tinyurl.com/Oz-rain-across-cities'
## 1500 Greenslopes Private Hospital 1/0/T/F/TRUE/FALSE   24.4 'https://tinyurl.com/Oz-rain-across-cities'
## 1501 Greenslopes Private Hospital 1/0/T/F/TRUE/FALSE   5.4  'https://tinyurl.com/Oz-rain-across-cities'
## 1516 Greenslopes Private Hospital 1/0/T/F/TRUE/FALSE   5.4  'https://tinyurl.com/Oz-rain-across-cities'
## .... ............................ .................. ...... ...........................................
## See problems(...) for more details.

Before you pivot your data notice the errors when it read it. If you didn’t catch the errors when you read the data, take a look at the structure of your data frame by clicking on the arrow next to the variable in the Environment. Or use the structure function str.

str(rain_across_cities)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 59175 obs. of  10 variables:
##  $ year                              : num  2020 2020 2020 2020 2020 ...
##  $ month                             : chr  "01" "01" "01" "01" ...
##  $ day                               : chr  "01" "02" "03" "04" ...
##  $ Subiaco Wastewater Treatment Plant: num  0 0 NA NA NA NA 0 0 1.6 0 ...
##  $ North Adelaide                    : num  NA NA NA NA NA NA 0 0 0 0 ...
##  $ Greenslopes Private Hospital      : logi  NA NA NA NA NA NA ...
##  $ Brisbane                          : num  0.4 0 0 0 0 0 0 0 8.4 7 ...
##  $ Observatory Hill                  : num  0 0 0 0 0 NA 8.8 0 0 0 ...
##  $ Canberra Airport                  : num  0.2 0 0 0 0 NA 0 0 0 0 ...
##  $ Melbourne Botanical Gardens       : num  0 0 0 0 6 7.4 0 0 0 0 ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2431 obs. of  5 variables:
##   ..$ row     : int  1490 1495 1500 1501 1516 1518 1521 1549 1587 1591 ...
##   ..$ col     : chr  "Greenslopes Private Hospital" "Greenslopes Private Hospital" "Greenslopes Private Hospital" "Greenslopes Private Hospital" ...
##   ..$ expected: chr  "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
##   ..$ actual  : chr  "110" "66" "24.4" "5.4" ...
##   ..$ file    : chr  "'https://tinyurl.com/Oz-rain-across-cities'" "'https://tinyurl.com/Oz-rain-across-cities'" "'https://tinyurl.com/Oz-rain-across-cities'" "'https://tinyurl.com/Oz-rain-across-cities'" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   month = col_character(),
##   ..   day = col_character(),
##   ..   `Subiaco Wastewater Treatment Plant` = col_double(),
##   ..   `North Adelaide` = col_double(),
##   ..   `Greenslopes Private Hospital` = col_logical(),
##   ..   Brisbane = col_double(),
##   ..   `Observatory Hill` = col_double(),
##   ..   `Canberra Airport` = col_double(),
##   ..   `Melbourne Botanical Gardens` = col_double()
##   .. )

There are a couple of ways to deal with this, but we’ll fix it by having R check more lines before deciding on the column type.

rain_across_cities <- read_csv("https://tinyurl.com/Oz-rain-across-cities",
                               guess_max = 10000)
## Parsed with column specification:
## cols(
##   year = col_double(),
##   month = col_character(),
##   day = col_character(),
##   `Subiaco Wastewater Treatment Plant` = col_double(),
##   `North Adelaide` = col_double(),
##   `Greenslopes Private Hospital` = col_double(),
##   Brisbane = col_double(),
##   `Observatory Hill` = col_double(),
##   `Canberra Airport` = col_double(),
##   `Melbourne Botanical Gardens` = col_double()
## )

Now each column looks like it was read as the correct type. Now you can pivot your data.

rain_long <- rain_across_cities %>%
  pivot_longer(4:10, 
               names_to = "site", values_to = "temp")

If you want to share your summary table you might want to convert to wide format to make it more human readable. We use pivot_wider and specify that the new column names come from the current city_name column and the values for all of the new columns come from the mean_max_temp column.

mean_max_temp_wide <- mean_max_temp %>%
  pivot_wider(names_from = city_name, values_from = mean_max_temp)

Click on the data in the Environment to see it.

You can save your new human-readable data.

write_csv(mean_max_temp_wide, path = "data/mean_max_temp_wide.csv")

There are a lot of complex ways to rearrange columns using the pivot functions. For example, if we want to pivot our rainfall data into wide format, we realize that we have multiple pieces of information tied to one observation. When we pivot to make the column names as the location we will have multiple observations per row. You can think of the rows and columns as axes, and information like period is now an extra axis that doesn’t fit well. For example, we can only set the column names as either coming from station_code OR city_name OR station name. Thus, when we pivot we can specify which columns to keep by listing them.

rain_wide <-rainfall %>% pivot_wider(c(year,month,day),
                         names_from = "station_name", values_from = rainfall)

We can also aggregate data on the fly. Think about what you need to do to generate a data frame where each row contains the sum of rainfall for the year at each station. When we pivot the data we want to just leave the year. However, if we get values from the rainfall column we’ll end up with a list of values in each cell (i.e. for each year-station_name combinations there’s a list of values). Therefore, in addition to pivoting in this way we need to specify that we want to sum this list.

rainfall %>% filter(!is.na(rainfall)) %>% pivot_wider(year,
                         names_from = station_name, 
                         values_from = rainfall,
                         values_fn = list(rainfall = sum))
## # A tibble: 163 x 8
##     year `Subiaco Wastew… `North Adelaide` `Greenslopes Pr… Brisbane `Observatory Hi…
##    <dbl>            <dbl>            <dbl>            <dbl>    <dbl>            <dbl>
##  1  1967             358              284.              NA        NA            1342.
##  2  1968             830.             693.              NA        NA             625.
##  3  1969             589.             552.              NA        NA            1447.
##  4  1970             875.             493.              NA        NA            1104.
##  5  1971             716.             704.              NA        NA            1109.
##  6  1972             569.             456.             875.       NA            1282.
##  7  1973             859.             696.            1098        NA            1492.
##  8  1974             843.             638.            1639.       NA            1602.
##  9  1975             601.             526.             978.       NA            1293.
## 10  1976             589              372.            1199        NA            1783.
## # … with 153 more rows, and 2 more variables: `Canberra Airport` <dbl>, `Melbourne Botanical
## #   Gardens` <dbl>

To get around needing multiple pieces of information per observation, two variables are sometimes stored in the column name of a wide dataset. You can see why we prefer long data. This dataset has both city name and station code in each column name. This allows for observations in different stations in each city. However if we want to get data from all stations in the city we need these two pieces of information separately.

rain_across_stations <- read_csv("https://tinyurl.com/Oz-rain-across-stations",
                               guess_max = 10000)
## Parsed with column specification:
## cols(
##   year = col_double(),
##   month = col_character(),
##   day = col_character(),
##   Perth_009151 = col_double(),
##   Adelaide_023011 = col_double(),
##   Brisbane_040383 = col_double(),
##   Brisbane_040913 = col_double(),
##   Sydney_066062 = col_double(),
##   Canberra_070351 = col_double(),
##   Melbourne_086232 = col_double()
## )
rain_across_stations %>% pivot_longer(
  cols = Perth_009151:Melbourne_086232,
  names_to = c("city", "station"), 
  names_pattern = "(.*)_(.*)",
  values_to = "rainfall"
)
## # A tibble: 414,225 x 6
##     year month day   city      station rainfall
##    <dbl> <chr> <chr> <chr>     <chr>      <dbl>
##  1  2020 01    01    Perth     009151       0  
##  2  2020 01    01    Adelaide  023011      NA  
##  3  2020 01    01    Brisbane  040383      NA  
##  4  2020 01    01    Brisbane  040913       0.4
##  5  2020 01    01    Sydney    066062       0  
##  6  2020 01    01    Canberra  070351       0.2
##  7  2020 01    01    Melbourne 086232       0  
##  8  2020 01    02    Perth     009151       0  
##  9  2020 01    02    Adelaide  023011      NA  
## 10  2020 01    02    Brisbane  040383      NA  
## # … with 414,215 more rows

This is a bit more elaborate than before. First we specify the range of columns to rearrange (this is the same as before). This time we want to break each column name in two. We now specify a list of columns in the names_to argument. Now we need to specify how to break the column name. We break the name at the underscore - you can see that in the middle. On each side of the underscore we specify the pattern of the name. In this case it can be any number of any character. We specify the latter with . and follow that with the * to specify multiple.

For more complex pivoting look at the Pivoting vignette on the tidyr page of the tidyverse website.

More data manipulation

Another analysis we might be interested in is whether temperatures for individual months are increasing.

Challenge: Examine how the temperature has increased over time in Canberra. Filter for data from Canberra. Get the average max temperature in each month for each year. Pivot year to be viewable.

First extract the month from the date information. Then filter for just maximum temperatures in Canberra. Then generate a summary table including the average temperature for each month-year combination. Then pivot this data so that each month is a row and you can view changes in temperature over the years by scrolling through columns.

monthly_temps_CAN <- temperature %>%
  filter(!is.na(temperature)) %>%
  mutate(year = year(date), month = month(date)) %>%
  filter(city_name == "CANBERRA") %>%
  filter(temp_type == "max") %>%
  group_by(month, year) %>%
  summarize(avg_temp = mean(temperature)) %>%
  pivot_wider(names_from = year, values_from = avg_temp)
head(monthly_temps_CAN)
## # A tibble: 6 x 107
## # Groups:   month [6]
##   month `1914` `1915` `1916` `1917` `1918` `1919` `1920` `1921` `1924` `1925` `1926` `1927` `1928`
##   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1   29.1   27.1   28.8   27.0   23.9   30.1   23.3   25.4   25.1   24.1   26.6   27.1   25.0
## 2     2   29.7   30.0   26.9   22.8   24.6   29.4   27.3   25.8   25.2   25.2   31.0   27.2   27.2
## 3     3   25.6   26.2   23.7   23.3   23.4   24.2   22.4   21.8   24.6   24.8   24.3   25.1   24.9
## 4     4   19.7   19.9   18.2   18.1   18.7   22.2   18.9   19.1   17.7   21.2   18.8   20.4   19.3
## 5     5   15.3   13.9   15.8   12.9   15.7   16.1   15.5   15.4   15.8   15.9   12.6   14.4   14.3
## 6     6   13.0   11.5   10.2   11.5   11.2   13.0   11.2   12.8   10.5   13.5   12.6   10.6   11.8
## # … with 93 more variables: `1929` <dbl>, `1930` <dbl>, `1931` <dbl>, `1932` <dbl>, `1933` <dbl>,
## #   `1934` <dbl>, `1935` <dbl>, `1936` <dbl>, `1937` <dbl>, `1938` <dbl>, `1939` <dbl>,
## #   `1940` <dbl>, `1941` <dbl>, `1942` <dbl>, `1943` <dbl>, `1944` <dbl>, `1945` <dbl>,
## #   `1946` <dbl>, `1947` <dbl>, `1948` <dbl>, `1949` <dbl>, `1950` <dbl>, `1951` <dbl>,
## #   `1952` <dbl>, `1953` <dbl>, `1954` <dbl>, `1955` <dbl>, `1956` <dbl>, `1957` <dbl>,
## #   `1958` <dbl>, `1959` <dbl>, `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>,
## #   `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
## #   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
## #   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## #   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
## #   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
## #   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>,
## #   `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
## #   `2018` <dbl>, `2019` <dbl>, `1913` <dbl>, `1923` <dbl>

Now we can summarize by month and plot.

monthly_temps <- temperature %>%
  filter(!is.na(temperature)) %>%
  mutate(year = year(date), month = month(date)) %>%
  group_by(month, year) %>%
  summarize(avg_temp = mean(temperature)) 

ggplot(monthly_temps, aes(month, avg_temp, group=year, color=year)) +
  geom_line() + geom_point()

Did you notice that the color is along a scale? For R, year is a continuous variable so the scale is continuous. This probably isn’t what you want. You might view year as a factor to separate the years distincly.

ggplot(monthly_temps, aes(month, avg_temp, group=year, color=factor(year))) +
  geom_line() + geom_point()

On the other hand, although you can see lighter colors mostly at the top, all those data are really messy anyway. There are really too many categories for a discrete variable. Let’s try grouping the information by before and after 1964 to make things simpler. We have already seen mutate to add a column. This time we want to add a column that designates whether the date is before or after 1964 using 0 or 1. We use the ifelse function to do this. Then we take this updated dataframe and group by both month and time period and summarize the average temperature for each group.

monthly_temps_by_period <- temperature %>%
  filter(!is.na(temperature)) %>%
  mutate(year = year(date), month = month(date)) %>%
  mutate(time_period = ifelse( year <= 1964, 0, 1)) %>%
  group_by(month, time_period) %>%
  summarize(avg_temp = mean(temperature)) 

Now we can plot the result. The x-axis is month (notice we’ve added the month function so this looks like a date). The y-axis is the temperature. And we have separate lines for our two time periods.

ggplot(monthly_temps_by_period, aes(month(month, label = T), avg_temp, 
                                    group=time_period, col=factor(time_period))) +
  geom_line() + geom_point() +
  scale_color_discrete(name = "", labels = c("Before 1964", "After 1964")) +
  labs(y = "Average Temperature (°C)",
       x = "Month") 

What do you observe about how monthly temperatures differ?

Save your plot!

ggsave('avg_monthly_temps.png')

We can also look at cities separately. Notice that I’ve redone my grouping for my summary table to include city. Notice the facet_wrap function here to automatically create separate plots.

monthly_temps_by_period <- temperature %>%
  filter(!is.na(temperature)) %>%
  mutate(year = year(date), month = month(date)) %>%
  mutate(time_period = ifelse( year <= 1964, 0, 1)) %>%
  group_by(month, time_period, city_name) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) 

ggplot(monthly_temps_by_period, aes(month(month, label = T), avg_temp, 
                                    group=time_period, col=factor(time_period))) +
  geom_line() + geom_point() + facet_wrap(~city_name) +
  scale_color_discrete(name = "", labels = c("Before 1964", "After 1964")) +
  labs(y = "Average Temperature (°C)",
       x = "Month") 

Joining data together

We have separate datasets for rainfall and temperature. What if we want to know the relationship between these variables. We need to get these data into a single dataframe. First think about how the data line up. We need to know that KENT (in the temperature data frame) is the same location as Adelaide (in the rainfall dataframe).

unique(temperature$city_name)
## [1] "PERTH"     "PORT"      "KENT"      "BRISBANE"  "SYDNEY"    "CANBERRA"  "MELBOURNE"
unique(rainfall$city_name)
## [1] "Perth"     "Adelaide"  "Brisbane"  "Sydney"    "Canberra"  "Melbourne"
temperature <- temperature %>% 
  mutate(city = recode(city_name,
                       PERTH = "Perth",
                       PORT = "NA",
                       KENT = "Adelaide",
                       BRISBANE = "Brisbane",
                       SYDNEY = "Sydney",
                       CANBERRA = "Canberra",
                       MELBOURNE = "Melbourne"
  ))

#alternative method
city_convert <- c("Perth","X","Adelaide","Brisbane","Sydney", "Canberra", "Melbourne")
city_convert <- append(unique(rainfall$city_name), "X", after = 1)
names(city_convert) <- unique(temperature$city_name)
temperature <- temperature %>% 
  mutate(city = recode(city_name,!!!city_convert))

We need to think about how our dates don’t match (yet). We also have two measurements for rain in Brisbane so we should probably average those. This will give us six cities and one measurement per city.

rain2 <- rainfall %>%
  filter(!is.na(rainfall)) %>%
  mutate(date = ymd(paste(year,month,day))) %>%
  select(city_name,date,rainfall) %>%
  group_by(city_name,date) %>%
  summarize(rainfall = mean(rainfall))

Now we can join the datasets together. For any given date in any given city we now have temperature and rainfall side-by-side.

rain_plus_temp <- left_join(rain2,
                            filter(
                              select(temperature,date,temperature,temp_type,city),
                              temp_type == "max"),
          by=c("date",c("city_name" = "city")))

We can look for seasonality

rain_plus_temp <- mutate(rain_plus_temp, month = month(date),
                         season = case_when(month > 2 & month < 6 ~ "fall",
                                            month > 5 & month < 9 ~ "winter",
                                            month > 8 & month < 12 ~ "spring",
                                            month == 12 | month < 3 ~ "summer"))

ggplot(rain_plus_temp,aes(temperature,rainfall)) +
  geom_point() + facet_wrap(~season)
## Warning: Removed 29539 rows containing missing values (geom_point).

Functions

Up to this point we have been using functions created by other people. Having a function allows us to skip manual work or copy-paste for things we do repeatedly. For example we prefer using the mean function to calculating the sum of values and dividing by the number of observations. We can write our own functions to repeat particular types of work and to ensure that when we repeat work it is repeated in exactly the same way.

Let’s say we want to compare fires occuring in Australia with fires in California. Perhaps the data collected in California on temperature is in degrees Fahrenheit. We could spell out our conversion factor every time we want to convert F to C, or we can define a function to make this calculation and call the function as needed.

We define fahrenheit_to_celsius by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}). The statements in the body are indented by two spaces, which makes the code easier to read but does not affect how the code operates. When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

For more on functions see Software Carpentry lesson and another lesson.

fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

Notice that the function contains its own variables. These are defined purely within the function and cannot be called outside the function.

Before we use this function let’s test it on a couple of examples so we know that it works. It it good practice to test your function on known cases as well as the strangest or most difficult cases you can think of.

# freezing point of water
fahrenheit_to_celsius(32)
## [1] 0
# boiling point of water
fahrenheit_to_celsius(212)
# a list of values
fahrenheit_to_celsius(c(32, 212))
## [1]   0 100
# this shouldn't work
#fahrenheit_to_celsius(q)

Since you can’t give the function anything other than a number let’s make sure you get a nice error message.

fahrenheit_to_celsius <- function(temp_F) {
  if (!is.numeric(temp_F)) {
    stop("You must provide a number or numeric vector.")
  }
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

#fahrenheit_to_celsius(q)

Challenge: Write a function to convert C to F and test it.

Now we can apply this function to our data.

tempCtoF <- function(temp_C) {
  (temp_C * 9/5) + 32
}

temperature <- mutate(temperature,tempF = tempCtoF(temperature))

Challenge: write a function to convert mm to inches so we can compare rainfall in California and Australia

mm_to_inches <- function(mm){
  #divide the length value by 25.4
  mm/25.4
}

When you have functions you plan to reuse, I recommend putting them in a separate R script and then using source to import those functions for use in your analysis script.